-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gpu improvements #66
Gpu improvements #66
Conversation
message("GPU_TYPE is ${GPU_TYPE} GPU_OFFLOAD is ${GPU_OFFLOAD}") | ||
if (GPU_TYPE STREQUAL v100 AND GPU_OFFLOAD STREQUAL OpenACC) | ||
string(APPEND GPUFLAGS " -acc -gpu=cc70,lineinfo,nofma -Minfo=accel ") | ||
endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we issue an error message and exit the compilation for a configuration other than V100 GPU and OpenACC? In this way we won't accidentally compile the code with undesired flags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we could add the A100 GPU and OpenACC configuration here with the flag -acc -gpu=cc80,lineinfo,nofma -Minfo=accel
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree with adding an error message and exiting - we want users to be able to expand the list of supported compilers, machines and offload options. I can look into adding a warning.
Yes we can add the a100 options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my concern is that saying we specify V100 GPU and OpenMP offload here, the codes will still compile but no GPU flags are actually applied, which is confusing to me. By adding an error message, we can force a user to add the correct options before the code can be compiled. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that I can enforce this by making the MAX_GPUS_PER_NODE variable in config_machines.xml compiler dependent. How does that sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I did not realize that you had provided a full list of combinations for OpenACC and OpenMP offload. In this case, my concern is probably unnecessary anymore.
machines/config_batch.xml
Outdated
</directives> | ||
|
||
<directives queue="casper" compiler="nvhpc-gpu"> | ||
<directives queue="casper" compiler="nvhpc"> | ||
<!-- Turn on MPS server manually --> | ||
<!-- This is a temporary solution and should be removed once MPS is integrated into PBS on Casper --> | ||
<directive default="/bin/bash" > -S /glade/u/apps/dav/opt/nvidia-mps/mps_bash </directive> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CISL has integrated the MPS into PBS scheduler. So we could turn on MPS by adding mps=1
to the PBS resource -l select=....
, which is the next line. This temporary solution is no longer needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can the mps=1 be added to the nvhpc select regardless of whether gpus are actually used or should it only be there for gpu cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question and I do not know. For safety, can we add the ngpus
and mps
options only when GPU_OFFLOAD
method is not none
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we have the ngpus_per_node
option for create_newcase
script. That could also be used to determine if a job should be submitted to a GPU queue or not, and what PBS resources should be selected later.
machines/config_machines.xml
Outdated
@@ -478,15 +479,17 @@ This allows using a different mpirun command to launch unit tests | |||
<command name="load">openmpi/4.1.0</command> | |||
<command name="load">netcdf-mpi/4.7.4</command> | |||
<command name="load">pnetcdf/1.12.2</command> | |||
</modules> | |||
<modules mpilib="openmpi" compiler="nvhpc" gpu_offload="OpenACC"> | |||
<command name="load">cuda/11.0.3</command> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I later realized that the nvhpc
module already comes with CUDA. So we probably did not need to load CUDA manually unless the newer CUDA version had a problem. May be worth testing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
already tested - on casper the cuda module seems to be required even though it shouldn't be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, I see. In this case, can we upgrade the CUDA version to 11.6
since that is the version coming with nvhpc/22.2
?
machines/config_machines.xml
Outdated
<command name="load">netcdf-mpi/4.8.1</command> | ||
<command name="load">pnetcdf/1.12.2</command> | ||
<command name="load">pnetcdf/1.12.3</command> | ||
</modules> | ||
<modules mpilib="mpi-serial" compiler="nvhpc"> | ||
<command name="load">netcdf/4.8.1</command> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we delete the lines 497 to 505?
machines/config_machines.xml
Outdated
<modules compiler="nvhpc" mpilib="openmpi" DEBUG="FALSE"> | ||
<command name="use">/glade/p/cesmdata/cseg/PROGS/modulefiles/esmfpkgs/nvhpc/22.2/</command> | ||
<command name="load">esmf-8.4.0b08_casper-ncdfio-openmpi-O</command> | ||
</modules> | ||
<modules compiler="pgi" mpilib="openmpi" DEBUG="TRUE"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we delete the PGI related settings for Casper now?
@@ -417,6 +417,7 @@ This allows using a different mpirun command to launch unit tests | |||
<DIN_LOC_ROOT_CLMFORC>/glade/p/cgd/tss/CTSM_datm_forcing_data</DIN_LOC_ROOT_CLMFORC> | |||
<DOUT_S_ROOT>$CIME_OUTPUT_ROOT/archive/$CASE</DOUT_S_ROOT> | |||
<BASELINE_ROOT>$ENV{CESMDATAROOT}/cesm_baselines</BASELINE_ROOT> | |||
<CCSM_CPRNC>$ENV{CESMDATAROOT}/tools/cime/tools/cprnc/cprnc</CCSM_CPRNC> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove pgi
and *-gpu
at line 413.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I haven't completed the cleanup here yet.
…p and combined options
machines/cmake_macros/nvhpc.cmake
Outdated
if (GPU_TYPE STREQUAL a100 AND GPU_OFFLOAD STREQUAL OpenACC) | ||
string(APPEND GPUFLAGS " -acc -gpu=cc80,lineinfo,nofma -Minfo=accel ") | ||
endif() | ||
if (GPU_TYPE STREQUAL a100 AND GPU_OFFLOAD STREQUAL OpenACC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace OpenACC
with OpenMP
.
Use nvhpc/22.11, cray-mpich/8.1.21, and esmf/8.4.1.b02 modified: machines/cmake_macros/gust.cmake modified: machines/config_machines.xml
…_config_cesm/compare/52c06b3..fbc05d6 Update settings on Gust and Casper modified: Depends.nvhpc deleted: Depends.nvhpc-gpu deleted: cmake_macros/nvhpc-gpu.cmake deleted: cmake_macros/nvhpc-gpu_casper.cmake modified: cmake_macros/nvhpc.cmake modified: cmake_macros/nvhpc_casper.cmake deleted: cmake_macros/pgi-gpu.cmake deleted: cmake_macros/pgi-gpu_casper.cmake modified: config_batch.xml modified: config_machines.xml deleted: mpi_run_gpu.casper
…asper and Gust modified: machines/config_machines.xml
modified: config_batch.xml
modified: machines/config_machines.xml
modified: machines/config_machines.xml
modified: machines/config_machines.xml
modified: machines/config_batch.xml
load cuda module on Gust for a GPU run deleted: cmake_macros/nvhpc_gust.cmake modified: config_machines.xml
modified: machines/config_machines.xml
modified: config_batch.xml
modified: machines/config_batch.xml modified: machines/config_machines.xml
modified: config_machines.xml
modified: machines/config_machines.xml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jedwards4b for issuing this PR. It generally looks good to me but I have some clarifications / change requests before approving it.
machines/config_machines.xml
Outdated
<modules mpilib="openmpi" compiler="nvhpc"> | ||
<command name="load">cuda/11.6</command> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we only load the CUDA module when gpu_offload
is not None? Or gpu_enabled
is False?
<MAX_MPITASKS_PER_NODE>36</MAX_MPITASKS_PER_NODE> | ||
<MAX_CPUTASKS_PER_GPU_NODE>36</MAX_CPUTASKS_PER_GPU_NODE> | ||
<GPU_TYPES>v100,a100</GPU_TYPES> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we specify gpu_offload
variable as well? For machine with AMD GPUs, openacc
and combined
may not be a valid input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we handle fortran do concurrent here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great question! In this case, combined
may be a confusing word, too. Is there any plan to introduce do concurrent
to CAM in the future?
<MAX_MPITASKS_PER_NODE>128</MAX_MPITASKS_PER_NODE> | ||
<MAX_CPUTASKS_PER_GPU_NODE>64</MAX_CPUTASKS_PER_GPU_NODE> | ||
<GPU_TYPES>a100</GPU_TYPES> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same question about gpu_offload
variable.
<MAX_MPITASKS_PER_NODE>128</MAX_MPITASKS_PER_NODE> | ||
<MAX_CPUTASKS_PER_GPU_NODE>64</MAX_CPUTASKS_PER_GPU_NODE> | ||
<GPU_TYPES>a100</GPU_TYPES> | ||
<PROJECT_REQUIRED>TRUE</PROJECT_REQUIRED> | ||
<mpirun mpilib="default"> | ||
<executable>mpiexec</executable> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At line 1838, shall we do purge
first before loading any module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a feature of modules on gust (and soon derecho) - the two modules loaded above the purge are sticky and not affected by the purge command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification. That is clear to me now.
<submit_args> | ||
<argument> -l gpu_type=$GPU_TYPE </argument> | ||
</submit_args> | ||
<directives queue="casper" compiler="nvhpc" gpu_enabled="true"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this works but I am not sure how gpu_enabled
actually works. Is it an XML variable defined somewhere? And how is it set to True or False during the build. A brief explanation will be very helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done here: https://github.com/jedwards4b/cime/blob/add_gpu_gust/CIME/case/case.py#L457
gpu_enabled is an attribute of the case object and is set to true if GPU_TYPE is set to a valid value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jedwards4b for the details. That is very helpful!
Remove special definitions for gpu enabled compilers. WIP